﻿ Computer Science Journal of Moldova, vol 19, no 2(56), 2011 Romanian Linguistic Resources On Very Large Scale Dan Cristea 1 Introduction This paper suggests a methodology for building a technological envi- ronment for linguistic processing, intended to conserve, update and exploit, for research, for public and for commercial purposes, strategic linguistic resources of the Romanian language, rooted in textual data contributed daily and in the long run by important editorial houses and mass-media institutions In essence, it describes a technology able to receive, store and continuously process large amounts of textual data, received from voluntary contributors, on a daily basis Apart from storing linguistic dataa la longuefor the beneﬁt of preserving the lan- guage, the results of the processing will be returned to three categories of users: the researchers working on Romanian language and compu- tational linguistics, the contributors of the resources, and the public at large Such an initiative is motivated by the growing needs for linguis- tic resources, including textual data and processing tools, which are manifested in social sciences and humanities, and which should bring the Romanian language1, now still less-resourced, to the level of technologically-rich languages of Europe Raising the quantity of re- sources dedicated to diﬀerent languages was a constant preoccupation in Europe over the past 15 years2, triggered by the necessity to boost c⃝2011 by D Cristea 1 Spoken by 28 000 000 people (from which 23 501 683 native), according to Latin Union report http://www ethnologue com/showlanguage asp?code=ron 2 Here are some of the most important projects and coalitions aimed at building linguistic resources and processing capacities, ﬁnanced by the EC, since 1995 till 1 Dan Cristea the language industry to the level that makes text and speech fully machine interpretable media, in more and more complex applications The solution proposed in this paper will not only satisfy today, but should build the roots for a continuous observation of the language, in its evolution Indeed, the permanent change of the language (Romanian of today is no more the same as that spoken or written in the middle years of the previous century, and this happens, to a greater of a less extend, to all living languages) makes language resources extremely volatile, as they become obsolete very quickly As the language evolves, and sometimes our vision with respect to the linguistic phenomena changes, the resources, themselves, get old Language resources should be kept aligned with the language evolution and the continuous update of the theoretical and computational views on language The proposal fulﬁls also important targets in the direction of lan- guage monitoring Keeping a language under surveillance can be com- pared with monitoring a volcano which manifest some activity from time to time Same as a volcano must be kept under strict observation, and diﬀerent physical and chemical parameters must be continuously recorded and interpreted in order to signal possible eruptions, as such preventing damages on the population before being too late, a mul- titude of features of a language can be recorded and the direction of its evolution can be identiﬁed Signiﬁcant events in the evolution of a language have to be signalled, as are the acquisition of new words, new expressions, or the emergence of new senses Tendencies must be perceived, such as a possible invasive inﬂuence from a foreign language, caused by its exaggerated exposure on public channels (TV, social web, etc ) If these factors are notiﬁed and signalled in time by a specialised service, then the adequacy and moment when an act should be engaged remains the attribute of appropriate decision factors (Academia, mass- media organisations or the university education system), in order to preserve the language and to keep its original spirit alive The scientiﬁc knowledge on language processing has reached to- today: TELRI, MULTEXT, MULTEXT goes EAST, EUROWORDNET, BALKA- NET, LT4eL, CLARIN, FlareNet, and the ongoing METANET with its 4 satellite projects: T4ME, METANET4U, CESAR and META-NORD 2 Romanian Linguistic Resources On Very Large Scale day an advanced level of competence internationally, also doubled by notable technological performances Languages that are today most vividly supported theoretically and technologically do make use of rich collections of linguistic resources, continuously updated These re- sources include corpora, i e collections of texts in original form, but also texts supporting annotation by experts, reﬂecting human compe- tence over linguistic phenomena, which can be incorporated, through learning mechanisms, into automated systems As resources are needed dramatically and many of them are very expensive, the issue of acquir- ing them should cease being episodic and must be driven by a national determination Our initiative reﬂects the point of view that the linguis- tic resources of the languages spoken in a country should be considered of national interest 2 Previous work The proposal advanced in this paper is scientiﬁcally supported by a number of research achievements in the ﬁeld of Romanian language technology and resources ALPE (Automated Linguistic Processing Environment) (Cristea and Butnariu, 2004; Cristea and Pistol, 2008; Pistol, 2011) is a theoret- ical framework for organising and exploiting the annotation added to texts A hierarchical organization of a universe of annotation schemas and processing links makes possible the design of complex linguistic processing workﬂows The model is intended to increase human skills to design linguistic processing tasks, to manipulate and re-use resources and tools, covering expertise levels that range from the expert down to the novice The ALPE philosophy can stay at the base of the processing capabilities of the Portal, the storing and processing entity described below, featuring intelligent human-computer interaction capabilities The elaboration of eDTLR, the electronic form of the Thesaurus Dictionary of Romanian Language, where approximately half of its tremendous number of citations (about 1 3 million) have been linked onto the original scanned books (Cristea and Raschip, 2008; Cristea et al , 2009; Haja and Cristea, 2010), has opened a huge ﬁeld of investi- 3 Dan Cristea gation in Romanian lexicography and computational linguistics The inventory of almost all words that have been used in the language in written form since the ﬁrst known documents in Romanian, the ex- tremely ﬁne collection of word senses and their associated deﬁnitions, the high number of citations selected for each sense of each word, or- dered chronologically, the indication of the etymological sources, and, as remarked, the associated sources, in scanned and OCR form, makes eDTLR one of the richest sources of linguistic data for Romanian lan- guage Finally, there are also other outstanding resources for Romanian language: DEX (the Explanatory Dictionary of Romanian Language) with its online form3, Romanian FrameNet (Trandabat, 2010; for En- glish FrameNet, see Fillmore et al , 2002), Romanian VerbNet (Moruz, 2010; for English VerbNet, see Kipper-Schuler, 2005), Romanian Word- Net (Tuﬁs et al , 2004; for the English WordNet, see Fellbaum, 1998), etc These resources are more or less complete, but even when they will reach a satisfactory coverage of the language there will still remain the need to keep them updated with the evolution of language The dynamics in language have to be mirrored in the strategic resources of the language as well In (Cristea, 2010), the idea of promoting a legislative initiative was investigated, that would impose to the producers of written texts (calledresourcersin the paper: editing houses, recording houses, stu- dios, etc ) the obligation to donate their linguistic resources in elec- tronic form for the beneﬁt of language research, without, this way, inducing any harm to them (producers or authors) as, for instance, induced by weakening their property control over the resources, or by commercial losses The proposal advanced in this paper prepares the ﬁeld for such a large scale implementation of the daring initiative to ac- quire language resources and to process them continuously, by consider- ing only voluntary donations of publications in electronic form (mainly books, journals, magazines, newspapers and web publications) 3 www dexonline ro 4 Romanian Linguistic Resources On Very Large Scale 3 The Portal and its Repository Technologically, the enterprise of sustaining a continuous ﬂow of lin- guistic data can be fulﬁlled by a platform (let's call itPortal) capable to receive, store, process and make accessible to researchers, the public and the language industry large amounts of linguistic data on Roma- nian language The storage section of thePortal(let's call itRepository), basically, includes the following three important types of resources: A) original documents { linguistic data in electronic form con- tributed by the voluntary donors (let's call themresourcers) The originals of these data are usually distributed on paper and/or electronic form on the culture and mass-media channels by the resourcers In order to be easily retrieved, the source ﬁles need to suﬀer a series of transformations, including classiﬁcation, in- dexing, statistical processing, archiving, assignment of persistent addresses, etc B) a set of representative, specialised and diachronic corpora of Ro- manian language, continuously updated These are documents selected from theRepositoryby applying rigorous rules of cor- pus consistency and balance (Sinclair, 1996), and covering a long period in time Part of these corpora will be automatically an- notated (minimally, for token, lemma and part-of-speech) and metadata will be generated according to generally accepted stan- dards, such as TEI (Burnard and Sperberg-McQueen, 1995) C) a collection of synchronised Romanian linguistic thesauri, in- tended to be oﬀered to researchers in the ﬁelds of social sciences, humanities and computational linguistics, kept updated out of the components A and B of theRepository Minimally, there are: eDTLR, RoWordNet, RoVerbNet, and RoFrameNet The syn- chronisation should be assured by automatic procedures capable to signal the occurrence in theRepositoryof pieces of language data that would ﬁt as adequate entries in these resources 5 Dan Cristea Sections B+C make up the collection ofstrategic resources A signiﬁcant library of web services, procedures that could be called online to perform elementary language processing tasks, facilitate the access to the Repository A bunch of processing modules need to be in- tegrated in an ALPE-like hierarchy, based on their input-output XML signatures Although the idea from which ALPE emerged has been published 7 years ago (Cristea and Butnariu, 2004), a thorough the- oretical model has only recently been ﬁnalised (Pistol, 2011) and a decisive implementation still waits to be realised 4 Designing the functionality of the Portal The realisation of the Portal should be based on a solid technological infrastructure Moreover, the enterprise should be sustained by a coali- tion of researchers, from the Romanian speaking areal as well as from outside it, able to communicate and work together It should not be neglected the formation of alliances with similar initiatives in Europe to envisage exchange of data, synchronisation of the data formats of web-services, creation of protocols for complex processing ﬂows involv- ing more languages, annotation standards and adoption of a uniﬁed processing framework (such as ALPE, for instance) As for the Portal itself, the design of its functionalities should ad- dresscommunication,storingandprocessing The communication section addresses the exchange of information between thePortaland the community of users and third parties The design of this section should see thePortalas a factory that processes words On one side, in input, the raw textual material is received from theresourcers, voluntary contributors of the resources, and on the other side, in output, the consumers should be served These are social sciences and humanity (SSH) researchers (in need for corpora, for aligned dictionaries and for sophisticated access onto thestrategic resources), the contributors themselves (in need for services that would allow them to raise their proﬁt, a kind of reward for their oﬀered data), and the general public (usually browsing the primary textual data and dictionaries for contexts of occurrence, linguistic and cultural knowl- 6 Romanian Linguistic Resources On Very Large Scale edge, statistical data on the language, etc ) The rapid accumulation of resources on thePortalimposes a perfect organisation of the storage section { theRepository A farm of storing devices should be kept running continuously, on which eﬃcient indexing algorithms should be used To prevent data losses due to unexpected events (disasters, technical failures) a redundant storing architecture and mirroring techniques must be used A preliminary estimation of the hardware support needed for stor- ing, based on data acquired by Serediuc (2010), shows that a capacity of 1 PB would be suﬃcient to host the electronic format of all books printed in Romania for a period of one century, including safety and backup It is important to note that the textual data should be recorded not only in their original format Since the main intend is to allow targeted searches in the collection of text material of theRepository(includ- ing metadata, words and compounds, expressions, part-of-speeches, name entities, time anchored events, semantic relations, collocations, frequent terms and n-grams, syntactic structures, ﬁrst or last known occurrences, absolute and relative frequencies, etc ), techniques to rep- resent textual data in complementary form to its original string for- mat, including indexes and XML-based formats shall be used In this respect, the technology of Word Sketch (Kilgarriﬀ et al , 2004; Macov- eiciuc and Kilgarriﬀ, 2010), which uses additional annotation attached to the word form, could be inspiring In contrast with other massive initiatives for storing and processing textual data, which obtains the character strings of the primary data after processing the paper format (as are the Google Books initiative4or the Gutenberg project5), my proposal takes into consideration only accurate textual data, therefore clean texts This is because texts will be contributed by the edit- ing houses that own the original data, therefore avoiding the scanning process followed by transforming the images onto character strings by OCR 4 \History of Google Books", http://books google com/intl/en/googlebooks/histo- ry html 5 http://www gutenberg org/ 7 Dan Cristea The processing section represents the back spine of the Portal An ALPE-like framework can implement the interoperability of basic com- ponents Each document, once placed on the portal, should be submit- ted to a processing chain that includes, minimally: tokenization, part- of-speech tagging, lemmatization and indexing This makes necessary that each raw text document be paired by a standoﬀ XML annotation referring to it The ALPE framework will combine the basic functionalities men- tioned above with other superior level functions that could be triggered by more advanced applications These can include: syntactic parsing, segmentation at sentence and clause boundaries, identiﬁcation of noun and verb phases, anaphora resolution, discourse parsing, summarisa- tion, etc 5 Keeping the data aligned and updated A number of resources, which are considered of a strategic importance in keeping a language technologically fresh, have to be continuously in- terconnected on theRepository Some of the most signiﬁcant of these resources are: the electronic version of the Thesaurus Dictionary of Ro- manian Language (eDTLR), the Romanian FrameNet (RoFrameNet) and the Romanian VerbNet (RoVerbNet) None of them are ﬁnalised, but even when this will happen, they should be maintained updated with the evolution of language In this section we propose a method- ology that allow them mirror the signiﬁcant changes that the language could stand It is obvious that changes usually manifest slowly, and often there are controversies whether a certain linguistic or syntactic tendency should be recorded as being accommodated by the language The same academic institutions will continue to take the ﬁnal decisions, the way they do it now, but the technology can help them to better monitor the language, to track frequencies of occurrences, to detect ﬁrst uses or depreciations We believe that each lexical item of the language must be repre- sented by aword ﬁleon theRepository, and this record should include references in all the strategic resources As such, to take an example, 8 Romanian Linguistic Resources On Very Large Scale the verbv's record is linked to its corresponding entry in eDTLR, where the inventory of senses is recorded, and these senses are aligned to the corresponding entries in RoVerbNet and RoFrameNet The Dictionary in its paper format, concentrates a signiﬁcant and exquisite eﬀort of the Romanian Academy for over a century (the last volume appeared in 2010) Originally, it has been printed in 36 volumes (19 { on the anastatic edition (Academy, 2010)), it contains more than 15,000 pages and about 175,000 entries, with citations collected from about 4,000 volumes of the written Romanian literature The electronic version (eDTLR) was created during the years 2007-20106 The entries are XML codiﬁed conforming to TEI P57 Some of the most signiﬁcant functions of the Portal are the dy- namic discovery of contexts in the input documents (with or without the recognition of word senses), signalling of new words, signalling of new senses, signalling of obsolete words and senses, identifying the lex- ical entry in citations, etc The process which should be placed at the base of recognizing new words and senses, as well as obsolete words and senses, presupposes placing a bag of words under constant surveillance These are words/senses plausible of becoming recently popular or, on the contrary, becoming under-used If we take the example of words manifesting a constantly degrading frequency, let's note that the crite- rion of absolute or even relative frequency, over a certain time interval, could prove not be relevant, because there are words which are very rarely used, although not being in danger of extinction (some science neologisms, for instance) The best way to do this is to associate to each word an individualword ﬁle, recording a set of dynamic features, among which the frequency of occurrence over time in speciﬁc registers (plots, out of which relative frequencies and gradients of deterioration over a constant interval of time, considered always back from the cur- rent day, could be computed) It is evident that eDTLR contains already many features that makes it the perfect host ofword ﬁles Nevertheless, its entries should cer- 6 Project no 91013/18 09 2007 under National Programmes II, section D, of the Romanian Ministry of Education, Research and Youth 7 http://www tei-c org/release/doc/tei-p5-doc/en/html/ 9 Dan Cristea tainly be revised in order to transform eDTLR onto the repository of word ﬁles Due to its peculiar importance on the panoply of Romanian resources, a special section of theRepositoryshould be dedicated to eDTLR On the other hand, FrameNet and VerbNet are resources initially built for English but which have got a lot of attention world wide because of their ability to capture the semantic structure and syntactic constrains of the verbal constructions Linking these resources onto eDTLR involves ﬁnding speciﬁc semantic structures ﬁtted for each word in the dictionary (usually, words triggering events, such as verbs or predicative nouns) At UAIC-FII, a number of tools intended to align diﬀerent pieces of the strategic resources of theRepositoryhave already been designed8 What needs to be done is to see the updating process synchronised with the ﬂow of data continuously uploaded onto thePortal 6 Addressing the users' needs The technological apparatus described in this paper is useless if con- sidered in isolation of the needs ofresearchers,resourcersandordinary citizens In this section I will discuss the needs of all these categories of users, in turn We have in mind diﬀerent categories ofresearcherusers, from ex- perts in computational linguistics and NLP to SSH researchers, con- sidered usually rather unskilful in the use of IT technologies This last category may include social scientists, archaeologists, historians, geographers, linguists, lexicographers, etc , all in need of working on textual data with sophisticated NLP technology In a recently ended project9, Cristea et al (2011) designed the lines of behaviour of an intelligent help desk addressing the needs of diﬀerent categories of re- searchers working with textual material, from experts in NLP and IT 8 For instance, links connecting eDTLR onto RoWN (Ivnescu, 2011), RoFrameNet (Trandab?, 2010), and RoVerbNet (Moruz, 2011) 9 CLARIN { EC FP7 project no 212230 10 Romanian Linguistic Resources On Very Large Scale down to novices A conﬁgurable interface should allow easy interaction of all these categories In a basic scenario, a researcher may access thePortalto ask for some type of processing on a ﬁle stored on theRepository, by following a dialogue with the Help Desk interface In a more sophisticated scenario, the user could be interested to upload her/his own ﬁle, and initiate a processing chain using the language technology available on thePortal The dialogue component of the interactive interface drives her/him onto a design process that should end up in putting together disparate processing modules (operating on the Portal or of which the Portal is aware of, although they operate distantly), thus conﬁguring a complete or only a partial solution to the problem at hand Another important user type is the provider of linguistic material, theresourcer(Cristea, 2010) The services addressing this user type should be, as much as possible, free, as an award for their oﬀer to donate linguistic data to the Portal Examples of services addressing the re- sourcers are: advertising on books printed by the editing houses, search facilities to browse electronic versions of books, annotation and retrieval of citations, automatic summarisation, authors' indexing, statistics and plots regarding number of accesses, search criteria, time, location, etc More elaborate demands, addressing cultivated readers, may include: search for exempliﬁcation of certain syntactic structures, occurrence of diﬀerent types of semantic relations, collocations, frequent terms and n-grams in books, search by name entities of public interest, including VIPs, in diﬀerent sources, automatic generation of genealogical trees of characters featuring in ﬁction or chronicles, tracing geographical jour- neys on travelling and historical books, etc Many more suggestions of services made possible by sophisticated NLP technology can be col- lected from the volunteerresourcersand, eventually, be implemented on the Portal Finally, the third type of user is thepublicat large We include here categories as school children and university students, people learning Romanian for the ﬁrst time or trying to improve their foreign language skills on the Romanian language, Romanian natives in search for deﬁni- tions, orthography rules, contexts of occurrence, old and contemporary 11 Dan Cristea usage of words, but also IT companies interested to access the NLP technology oﬀered by the Portal programmatically, for instance to in- clude in their authored software APIs or NLP library codes A signiﬁ- cant library of Web-services should allow users to access the resources on the Repository, after they have been processed by thePortal, for goals other then research A business model describing the exploitation of the resources on the Repository should be deﬁned for this purpose 7 Business model and IPR issues The formation of a coalition ofresourcers, potential partners supposed to accept to contribute on a continuous basis their resources to the Portal, is of an extreme importance Without such contributions, the whole skeleton depicted in this visionary paper would prove weak and illusory Bringing the owners of resources, mainly editorial houses, on-board means to engage a process of persuasion with them A Mem- orandum of Understanding should be signed with each possible part- ner, with the intention to harmonise her/his interests with those of the project and stating clearly that neither the printing houses nor the author's IPRs over the texts will be aﬀected The main message delivered by the MoU is to discourage the partners' presumption that the alliance with the project would be potentially harmful for them On the contrary, they should feel that, by entering into this coalition, incredible possibilities to exploit their data will be oﬀered by the new technologies of intelligent text processing The contributor of the resource is in the position to decide whether the access of the end user could be free or bound to certain commercial restrictions The general politics to be adopted is that if a fee should be imposed on the public area use then it should mainly return, as a ben- eﬁt, to the proprietary of the corresponding resources In any business model, thePortalshould keep for itself only that part of the fee that would allow it to support the expenses After an initial installation, thePortalshould work on a self-sustained basis The business model becomes thus essential for the long live of the enterprise From a business model point of view, two main types of uses can be 12 Romanian Linguistic Resources On Very Large Scale imagined depending on whether the user has or not a commercial inter- est The general public, being triggered by a non-commercial interest in exploiting the data on the Portal, should, in principle, be excerpted from fees The commercial access, in the beneﬁt of organisations and companies needing NLP technologies for all kinds of applications, on the contrary, should be paid, and the beneﬁts should go to the contrib- utors of the resources, as mentioned above The other very important issue in collecting resources stays in not They should become aware that none of their IPR are aﬀected 8 Conclusions It is the author's believe that implementing the proposals advanced in this paper will boost the language processing capacities for Roma- nian language to a level compatible with technologically most advanced languages in Europe I am aware that the sophisticated technology described in this pa- per is not possible to be realised without a wealthy basis Financial and human resources able to build the Portal and make it work should be looked for and deployed A proper awareness campaign that would sen- sibilise responsible political decision makers and stakeholders towards accomplishing this goal, needs also be launched The project, if ﬁnanced, will boost the interest of the Romanian speaking people towards applications based on advanced natural lan- guage technologies It will open to researchers in SSH and computa- tional linguistics and to the public an incredible amount of linguistic data, to be exploited for research on language and on social aspects that can be reﬂected in language It will mean also a tremendous step forward for that part of the industry which is oriented on natural lan- guage applications and semantic web, more and more keen to bring natural language on mobiles and desk interfaces The technological background described in this paper will facilitate the acquisition of the much wanted strategic resources for Romanian language and will open the gate towards keeping them aligned with the language evolution A recent initiative of the META-NET consortium ﬁnalised the 13 Dan Cristea drafting of a series of Language White Papers for 30 European lan- guages The META-NET Language White Paper series \Languages in the European Information Society"10reports on the state of each Euro- pean language with respect to Language Technology and explains the most urgent risks and chances Summarising tables11indicate cluster- based rankings of all languages in four sample areas: text analysis, speech, machine translation, and resources In both areas of text anal- ysis and resources Romanian is shown as belonging to Cluster 4 (out of 5), meaning medium support: \research prototypes/resources exist, but quality and coverage varies"12 Finally, not least, such a wide enterprise of collecting resources can- not be realised without the consent of the owners of texts to donate their data A recent investigation spotting few of the most important producers of printed information in Romania revealed that many edit- ing houses are keen to donate their resources for research purposes, if they would gain the conﬁdence of not being exposed to any com- mercial or property damage Gaining the owners' trust that, on the contrary, donating linguistic data means thriving, not loosing, is still a task ahead us Acknowledgments I am grateful to the ICT-PSP projects of the European Commission ATLAS (http://www atlasproject eu/) and METANET4U (http://me- tanet4u eu/) for supporting part of the work described in this pa- per, and to my colleagues Lucian G^adioi, Adrian Iftene and Diana Trandabat for their contributions to the ellaboration of a project pro- posal in lines with the objectives described in this paper 10 downloadable from http://www meta-net eu/intranet/language- whitepapers/ﬁles-for-publication/whitepapers 11 These statistics, very fresh, at the moment of drafting this paper are not yet made public 12 Where Cluster 1 \excellent LT support: \technologies/resources exist that are in widespread use and cover practically all linguistic phenomena { vocabulary, com- pounds, grammar, metaphors etc { of a language" shows no entries, and Cluster 5 means \low to almost no support: from the drawing board to rudimentary proto- types { very limited quality and coverage, toy systems" 14 Romanian Linguistic Resources On Very Large Scale References *** (2010)Dictionary of the Romanian Language (in Romanian), anastatic edition, following Dictionary of the Romanian Language (DA) and Dictionary of the Romanian Language (DLR), Romanian Academy Printing House, Bucharest Burnard, L , Sperberg-McQueen, C M (1995) The Design of the TEI Encoding Scheme, Computers and the Humanities, 29 (1) Cristea, D (2010) Very large language resources? At our ﬁnger! In Proceedings of the Workshop Language Resources: From Story- board to Sustainability and LR Lifecycle Management, LREC-2010, Valleta Cristea, D and Butnariu C (2004) Hierarchical XML represen- tation for heavily annotated corpora In Proceedings of the LREC 2004 Workshop on XML-Based Richly Annotated Corpora, LREC- 2004, Lisbon Cristea, D , Pistol, I (2008) Managing Language Resources and Tools Using a Hierarchy of Annotation Schemas Proceedings of the Workshop on Sustainability of Language Resources, LREC-2008, Marrakech Fillmore, C J , Baker, C F , and Sato, H (2002) The framenet database and software tool In Proceedings of the Third Interna- tional Conference on Language Resources and Evaluation, Las Pal- mas Ivanescu, M -L (2011) Automatic Techniques for ﬁlling in the Ro- manian WordNet out of eDTLR (in Romanian),graduation paper, \Alexandru Ioan Cuza" University, Faculty of Computer Science, Iasi Kilgarriﬀ, A , Rychly, P , Smrz, P and Tugwell, D (2004) The Sketch Engine, Proc Euralex, Lorient 15 Dan Cristea Kipper-Schuler, K (2005) VerbNet: A broad-coverage, comprehen- sive verb lexicon Ph D thesis, Computer and Information Science Dept , University of Pennsylvania, Philadelphia, PA Macoveiciuc, M and Kilgarriﬀ, A (2010) The RoWaC Corpus and Romanian Word SketchesIn: Multilinguality and Interoper- ability in Language Processing with Emphasis on Romanian Edited by Dan Tuﬁs and Corina Forascu, Romanian Academy Publishing House, Bucharest Moruz, M A (2011) Predication Driven Textual Entailment, Ph D thesis, \Alexandru Ioan Cuza" University, Faculty of Com- puter Science, Iasi Pistol, I (2011) The Automated Processing of Natural Language, Ph D thesis, \Alexandru Ioan Cuza" University, Faculty of Com- puter Science, Iasi Serediuc, F (2010) High Volume Textual Processing (in Roma- nian), graduation thesis, \Alexandru Ioan Cuza" University, Fac- ulty of Computer Science, Iasi Simionescu, R , Cristea, D (2011) Help-desk and registry, CLARIN report M6C-3 3 Trandabat, D (2010) Natural Language Processing Using Seman- tic Frames, Ph D thesis, \Alexandru Ioan Cuza" University, Fac- ulty of Computer Science, Iasi Tuﬁs, D , Barbu, E , Barbu-Mititelu, V , Ion, R , and Bozianu, L (2004) The Romanian Wordnet In Dan Tuﬁs (ed ), Romanian Journal on Information Science and Technology Special Issue on BalkaNet, volume 7 Romanian Academy Dan Cristea Received October 13, 2011 Faculty of Computer Science, \Alexandru Ioan Cuza" University of Iasi Institute of Computer Science, Romanian Academy, the Iasi branch E{mail:dcristea@inf o:uaic:ro 16